A Corpus-Based Statistical Approach to Automatic Book Indexing

نویسندگان

  • Jyun-Sheng Chang
  • Tsung-Yih Tseng
  • Sur-Jin Ker
  • Ying Cheng
  • Huey-Chyun Chen
  • Shun-Der Cheng
  • John S. Liu
چکیده

The paper reports on a new approach to automatic generation of back-of-book indexes for Chinese books. Parsing on the level of complete sentential analysis is avoided because of the inefficiency and unavailability of a Chinese Grammar with enough coverage. Instead, fundamental analysis particular to Chinese text called word segmentation is performed to break up characters into a sequence of lexical units equivalent to words in English. The sequence of words then goes through part-ofspeech tagging and noun phrase analysis. All these analyses are done using a corpus-based statistical algorithm. Experimental results have shown satisfactory results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Briefly Noted

With the explosion in the quantity of online information in recent years, automatic abstracting and indexing has received renewed interest and a number of promising approaches have emerged. The goal of this book is to present a complete description of current indexing and abstracting techniques in the context of the underlying linguistic and statistical knowledge. The book has three parts: the ...

متن کامل

Corpus-Based Learning Of Compound Noun Indexing

In this paper, we present a corpusbased learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. We develop an e cient way of extracting the compound noun indexing rules automatically and perform extensive experiments to evaluate our indexing rules. The automatic learning method shows about the same performance compared with ...

متن کامل

n-grams of Seeds: A Hybrid System for Corpus-Based Text Summarization

This paper presents a hybrid system for automatic text summarization which combines statistical and knowledge-based methods. In particular, it demonstrates how two corpus-based learning and indexing algorithms, namely an n-gram and a seed-oriented approach, may be combined to bring out the best of both approaches. This system selects sentences from an input text to constract a highly compressed...

متن کامل

Noun-Phrase Analysis in Unrestricted Text for Information Retrieval

Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992